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ABSTRACT 

MEME and many other popular motif finders use 
the expectation-maximization (EM) algorithm to 
optimize their parameters. Unfortunately, the 
running time of EM is linear in the length of the 
input sequences. This can prohibit its application 
to data sets of the size commonly generated by 
high-throughput biological techniques. A suffix 
tree is a data structure that can efficiently index a 
set of sequences. We describe an algorithm, Suffix 
Tree EM for Motif Elicitation (STEME), that approxi- 
mates EM using suffix trees. To the best of our 
knowledge, this is the first application of suffix 
trees to EM. We provide an analysis of the 
expected running time of the algorithm and demon- 
strate that STEME runs an order of magnitude more 
quickly than the implementation of EM used by 
MEME. We give theoretical bounds for the quality 
of the approximation and show that, in practice, 
the approximation has a negligible effect on the 
outcome. We provide an open source implementa- 
tion of the algorithm that we hope will be used to 
speed up existing and future motif search 
algorithms. 

INTRODUCTION 

Reverse-engineering transcriptional regulatory networks is 
a major challenge for today's molecular cell biology. 
High-throughput methodologies such as ChlP-seq (1), 
ChlP-chip (2) and DamID (3) are generating ever larger 
data sets on the binding locations of transcription factors 
(TFs). However, the resolution of these techniques is still 
an order of magnitude or two larger than a typical tran- 
scription factor binding site (TFBS) (4). There remains a 
need to determine the binding sequence preferences of TFs 
and hence the exact locations of TFBSs from these data 
sets. For example, knowledge of these sequence 



preferences can be used to computationally predict 
binding sites in different cell types or in different organ- 
isms. In addition, understanding the exact locations of 
TFBSs for cooperating TFs can help us understand com- 
binatorial transcriptional regulation (5). This task of 
inferring the sequence preferences of a TF is termed 
'motif finding'. 

A typical high-throughput experiment might generate a 
data set of thousands of sequence fragments. Each 
fragment could be hundreds of base pairs long. The 
sequence preferences of a TF are relatively short, typically 
8-12 bp. Mismatches to the preferred bases are common in 
TFBSs. Determining these sequence preferences from the 
few binding sites in the fragments is a difficult problem. 
However, much effort has been dedicated to this motif 
finding problem and many algorithms and softwares 
exist for this purpose. The area has been reviewed 
several times (6-9). 

Most motif finders can be broadly categorized as either 
combinatorial or probabilistic. Combinatorial motif 
finders search for consensus sequences. TFBSs are pre- 
dicted on the basis of the number of mismatches with 
these consensus sequences. Probabilistic motif finders 
infer position weight matrices (PWMs) that specify a dis- 
tribution of bases for each position in a TFBS. PWMs are 
more flexible models of TFBSs than consensus sequences 
and are typically preferred. Most of the probabilistic motif 
finders use either the expectation-maximization (EM) al- 
gorithm (10) or a Gibbs sampling algorithm (11) for in- 
ference. Examples of motif finders that use the EM 
algorithm include Refs (12-20). 

The volume of available TF binding location data is 
rapidly increasing. Both the number and the size of data 
sets generated by techniques such as ChlP-chip, ChlP-seq 
and DamID continues to grow. Unfortunately, the 
runtime of most motif finders is at least linear in the size 
of the data. In our experience, most motif finders are far 
too slow for such large data sets of sequences. While it 
may be possible to let the motif finder run for several days, 
invariably the user would like to fine-tune parameters. 
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This may involve several runs which makes motif finding 
impractical. 

MEME (13) is one of the most popular motif finders. It 
has a long pedigree: the original version was published in 
1994. MEME was one of the best performing motif finders 
in a comparative benchmark review (21). MEME has a 
large user-base that understand its parameters and trust its 
results: the primary paper describing its algorithm is cited 
more than 300 times on PubMed. Unfortunately, MEME 
takes a prohibitively long time to run on large data sets. 
The MEME authors acknowledge this and recommend 
discarding data from large data sets in order to make 
runtimes practical. They suggest a limit of 200 000 bp on 
the size of input data set. In our experience, the users of 
MEME are not always aware of this advice and can be 
frustrated when using MEME on large data sets. In any 
event, discarding data is a far from ideal work around as it 
necessarily detracts from the power of the method. Hence 
there is a need to make MEME and other motif finders 
more efficient. This article focuses on speeding up the EM 
algorithm that is a core component of MEME and many 
other motif finders. 

Various attempts have been made to speed up MEME 
in recognition of its poor performance on large data sets. 
The authors of MEME have implemented a parallel 
version of MEME, ParaMEME (22). Other approaches 
use specialized hardware such as parallel pattern 
matching chips on PCI cards (23) or off-loading the com- 
putations onto powerful GPUs (24). All these techniques 
require hardware that is not commonly available to the 
typical researcher. 

In this article, we propose an alternative route to accel- 
erate MEME by using suffix trees. A suffix tree (25) is a 
data structure that represents a sequence or set of se- 
quences. Suffix trees are well suited to algorithms that 
require efficient access to subsequences by content rather 
than by position. They have been used in several areas of 
bioinformatics: sequence alignment (26), indexing 
genome-scale sequences (27) and short read mapping 
(28). They have also been used for combinatorial motif 
finding (29-31) and scanning for PWMs (32). To the 
best of our knowledge, the work presented here is the 
first application of suffix trees to probabilistic motif 
finding and the EM algorithm in particular. 

In the 'Materials and Methods 1 section, we describe 
MEME's probabilistic model and how MEME uses the 
EM algorithm to optimize its parameters. We describe 
an approximation to EM and show how suffix trees can 
be used to implement this approximation (the STEME 
algorithm). We analyse the expected efficiency gains we 
expect to achieve with this approximation. We describe 
our open source implementation of the STEME algo- 
rithm. In the 'Results' section, we describe the tests we 
undertook to establish the accuracy and efficiency of 
STEME in practice. We examine the effect of varying 
the motif width and the main parameter in our algorithm 
on the accuracy and efficiency. In the 'Discussion' section, 
we look at the implications of the results and suggest how 
our algorithm can be best used. We conclude with an 
outlook for future work. 



MATERIALS AND METHODS 

MEME 

MEME uses the EM algorithm to improve a model of 
the motif iteratively. In each iteration, the locations of 
the binding sites are estimated using the current 
model of the motif and the motif is updated using 
the predicted sites weighted by their likelihoods. The 
EM algorithm is guaranteed to converge to a local 
maximum of the likelihood function but is very sensitive 
to initial conditions. To mitigate this sensitivity, MEME 
runs the EM algorithm many times from different starting 
points. 

MEME's model. For a particular motif width, W, MEME 
treats every subsequence of length W (henceforth Winter) 
in the data independently. Given a motif width, W, 
MEME models each W-mer in the sequences as an inde- 
pendent draw from a two-component mixture. One 
mixture component models the background sequence 
composition, the other models binding sites. The binary 
latent variables, Z = \Z\, . . . , Z N }, indicate whether each 
W-mer, X n , is drawn from the background component or 
the binding site component. MEME has several different 
variants of its model which the user can choose between. 
They vary in how the sites are distributed among the se- 
quences. The oops variant insists that there is exactly One 
Occurrence Per Sequence. For most experimental data, 
this is not a realistic assumption and those sequences 
that do not contain a site can reduce MEME's ability to 
find the motif. The zoops variant allows Zero or One 
Occurrences Per Sequence. This is more plausible for 
most experimental data sets but will not take statistical 
strength from more than one site in a sequence. The anr 
variant allows any number of binding sites in each 
sequence. This variant is the most flexible and is the 
most suitable for most applications. However, it is also 
the most computationally demanding: care must be 
taken in the algorithm when sites overlap otherwise 
MEME will tend to converge on self-overlapping motifs. 
This is because MEME's assumption that the W-mers are 
independent breaks down as each W-mer will overlap with 
up to 2(W— 1) other Homers. Nevertheless, as homotypic 
clusters of binding sites are common in transcriptional 
networks, we will focus on this anr variant in the rest of 
this article. 

In the anr variant, the background component is 
modelled using a Markov model parameterized by 0 BG , 
the binding site component is modelled by a PWM 
parameterized by 8 BS , and X parameterizes the probability 
that any given W-mer is drawn from the binding site com- 
ponent. Thus the model is 

p(Z n = l\X) = k (1) 

P(x n \z n , e BG , e BS ) = P (x„\e BS ) z "p(x„\e BG ) l ~ z " (2) 

where {X u . . . , X N } are the W-mers and {Z\, . . . , Z N ) are 
latent variables indicating whether the W-mers are drawn 
from the background or binding site model. This gives the 
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Figure 1. MEME's model: k, the prior probability of a binding site; 
Z„, the hidden variable representing whether the n-th IV-mer is an 
instance of the motif; X„, the n-th W-mer; Q BS , the parameters of the 
motif; Q BG , the parameters of the background distribution. 



joint distribution 

P (x, z\x, e BG , 9 BS ) 



= \\p(Z n \X)p(X n \Z n , 

#BG, $Bs) 

;;=1 

N 

= n^( Z "l 0 Bs)] Z "[(l - ^Wbg)]'" 2 " 



(3) 



ll=\ 



The model is depicted in plate notation in Figure 1. 

EM. In the E-step of EM, MEME derives the expected 
value of the log likelihood, LL, w.r.t. the latent variables, 
Z, given the current parameter estimates, 6 = {0bs> 
6 BG , X}. All expectations, (.)z\q, are w.r.t. Z|0 unless 
specified. 



(LL) = (logp(X,Z\X,6 BG ,6 BS )) 

N 

= J2{Zn)log[Xp(X„\9 BS )] 



(4) 
(5) 



+ (l-<Z fl »log[(l-A)X^„|0BG)] 

From Equation (3) and an application of Bayes' theorem 
^(^I^bs) 



(Z„) = 



Xp(X n \6 BS ) + (\ - X)p(X„\0 BG ) 



(6) 



The M-step maximizes the expected log likelihood w.r.t. 
each parameter in turn to calculate their new estimates. 
On inspection of (5), we can see 



X^> argmax ^(Z„) log A + (1 - (Z„))log(l - A) 



N 



(7) 



# BS >-^ argmax ^{Z„>logX^»IM (8) 

Bus „ 

0 BG -> argmax ^(1 - (Z„» log X*«|0bg) (9) 

MEME uses a PWM model for binding sites where 
9bs = {fwb}- fwb parameterizes the probability of seeing 
base b at position w in a TFBS. 



p(X n \9 BS ) = Y[fw> 



(10) 



Here X nw is the w-th base of the «-th W-mer. The update 
equations are 



fwb 1 



Z n (Z n )I(X„ w = b) = c wh 



(11) 



where c,,,/, is the expected number of times we see base b at 
position w in a binding site and 5 is the expected number 
of binding sites. 

By default, MEME uses a 0-order Markov model for 
6 BG . This is updated by the expected counts of the bases 
which are not in binding sites. 

At the end of each iteration of EM, MEME adjusts its 
estimates for the Z„. This accounts for the fact that the 
model does not prohibit overlapping binding sites. By way 
of explanation, suppose that there are 12 consecutive 'A's 
in the data and the current estimate of the motif models 
binding sites of eight consecutive 'A's. Without this ad- 
justment, MEME would assign (Z„> w 1 to the 5 W^-mers 
in the consecutive 'A's. The sum of the (Z n ) represents the 
number of binding events we expect in that window. 
Sterically, five TFs cannot bind to sites of width 8 in a 
12 bp window. MEME's algorithm leaves the highest (Z„) 
unchanged and scales the others down so that they sum to 
at most 1. Without this adjustment, repetitive sections in 
the input sequences can cause MEME to converge on 
motifs of low complexity that have frequently overlapping 
binding sites. 

Expected running time. Each iteration of MEME's EM 
algorithm evaluates the current estimate of the motif on 
each W-mer. The algorithm to adjust for overlaps also 
runs in O(NW) time hence an iteration of EM completes 
in 0{NW) time. However, it should be noted MEME's 
algorithm as a whole is quadratic in jV as the number of 
seeds is proportional to N. 



Approximation to EM 

The updates in the M-step of the EM algorithm all involve 
sums of the form £ n (Z n ) ■ ■ ■ where n ranges over all 
W-mers in the data set. In any given iteration of EM, 
depending largely on the current 8 BS , most of these (Z„> 
will be negligible. We can make an approximate M-step by 
ignoring those n for which (Z„) is small. We formalize this 
by defining a subset of the n thresholded by (Z„): 



{« : (Z„> > 8, 1 < n < N] 



(12) 



Intuitively, 7g indexes those W-mers that match our 
current motif estimate.^ As an approximation to 
Equation (11), we define /„./,, c,,./,, S 



Jul 



For 



2ZneT s (Zn)KX mv = b) = £ H , h 
EiieTs(Zn) S 

convenience 



(13) 



of notation, we define 
Cwb = Y,,vjtT s ( Z n}KX m , = b), S= E^r s ( z »l . and 

N = N - | T s \ so that c wb = c wb + c wb , S= S + S and 
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N = | Tg | + N. The relative error, eg, in our approximation 

fwb off wb is 



eg 



fwb~ fwb Cylb s Cwb 



fwl 



Cwb 



_ S V . S„ 

Cwb^S = C w b — I — — 1 I CnA = Cm* — — Cw6 

Noting that 5 and all the counts, c, are positive 



\e s \ < max 



Cwb Sc w fr 
Cwb Sc wb 



(14) 



We know from the definition of Tg that S < SN and that 

Cwb < SN 



\e s \ < max 



57V SN\ SN 

Cwb Si Cwb 



(15) 



So <5 > ' 6s x ■ If we knew c wh we would know which 8 
would guarantee our desired bound in the motif estima- 
tion relative error. Unfortunately, at the beginning of an 
EM iteration when we want to choose 8 we do not know 
c W b- in practice, this is not a problem. Given X and the 
current motif parameters, we can estimate c wh fairly ac- 
curately. When the EM iteration has completed, c„b is 
available. We can check the above equations to ensure 
that the relative error is less than |eg|. If it is not, we can 
calculate a new 8 for which the relative error is guaranteed 
to be small enough and re-run the iteration. In our tests, 
this was never necessary. Also c„,/> tends to change slowly 
over iterations, making its estimation straightforward in 
all but the first iteration. 



Suffix trees 

A suffix tree is a data structure that stores a sequence or a 
set of sequences. Typically, sequences are stored as con- 
tiguous buffers. This permits fast access to subsequences 
indexed by their position. Suffix trees are alternative data 
structures that allow efficient access to subsequences by 
their content. Re-writing algorithms to use suffix trees 
can often achieve significant efficiencies. 

Suppose we have a sequence, Y = y x . .. y T . A suffix of Y 
is any subsequence, y t . . . y T , that ends at y T . A suffix tree 
stores every suffix of the given sequence(s) in a tree struc- 
ture. An example of a suffix tree is shown in Figure 2. 
Now we show how to iterate over all the subsequences 
of length W in Y (the Warners). Each such W-mer is the 
start of a suffix. Hence, descending the tree to depth W 
iterates over all the W-mers. If two W-mers have the same 
content, they will be represented once by the same path in 
the tree. Contrast this with the random access of a typical 
contiguous buffer data structure for sequence storage. A 
contiguous buffer permits fast random access to a W-mer 
at a given position but takes no account of identical or 
similar Warners. If we have an application where we are 
not interested in the position of the W-mers, a suffix tree 
can be a more efficient data structure to iterate over them. 
Another attractive property of suffix trees is that they can 
be constructed in linear time and space (33). 




Figure 2. A suffix tree that represents the sequence 'BANANA'. The 
beginning of the sequence is represented by the symbol a and $ is a 
termination symbol. Note that the subsequence 'ANA' occurs twice in 
the sequence but is represented once in the tree. 



Suffix tree EM efficiencies. The EM algorithm must visit 
every W-mer in the sequences once on each iteration. By 
using a suffix tree to enumerate all the W-mers, we imme- 
diately achieve two efficiency improvements. First, if any 
two W-mers are identical, the (Z n ) calculations are equiva- 
lent and we do not repeat them. Second, as we descend the 
suffix tree to enumerate the W-mers, we make partial 
evaluations of our current motif on what we have seen 
of the W-mer so far. These partial evaluations are 
shared across all the W-mers below the current node in 
the tree. In contrast, MEME evaluates every base in each 
W-mer once. 



Branch-and-bound 

Recall from Equation (12) that we need to identify all n 
with (Z„) > 8 for a given 8. We iterate over the Homers by 
descending the suffix tree. Suppose we have an upper 
bound on the (Z„) of all the W-mers below any node. If 
this bound is below 8, then we can ignore the entire branch 
of the suffix tree below the node. In this way, we avoid 
evaluating large parts of the tree that do not fit the current 
estimate of the motif well. We illustrate the idea in 
Figure 3. 

We define X%~ as the prefix of X„ of length w and X„ + 
as the suffix of length W- w (so that X„ = Z];'-^; + ). We 
can write the likelihoods of the X„ in terms of their prefixes 
and suffixes 

p(x n \6 BS ) = p(x>-\9 BS )p(x;; + \9 BS ) 
P(x n \e BG ) = p(x;-\9 bg )p(K; + \6 bg ) 

We can enumerate the W-mers in the data by descending a 
suffix tree. Each node we visit represents the prefix of all 
the W-mers below it. Given our binding site and back- 
ground models, we can calculate the p(X^~\9 BS ) and 
p{X^\6 BG ) exactly. Suppose we can also bound 
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Figure 3. An illustration of how the STEME branch-and-bound algo- 
rithm works. Top: the current estimate of the motif in the EM algo- 
rithm. This is actually the motif for Stat5 from the TRANSFAC 
database (M00223). Bottom: part of the suffix tree representing the 
sequences. We can see that if we have descended the tree to the node 
that represents the prefix, GCAT, our match to the motif is poor. If the 
bound for the (Z„) of all the nodes below this is small enough, we can 
stop our descent here. 



p(X^ + \0^s) from above and p{X^ + \0 B o) fr° m below. 
Recalling (6) we can use these bounds to bound (Z„) 
above. In more detail, suppose p(X% + \9ns) < p(^n + \&Bs) 
and p(X^ + \d B a) > p(X% + \&bg) then using 0 as a lower 
bound for />(^,' + |#bs) we have 



(Z„) < <Z„ 



xp(r:-\e BS ) P (x-+\d BS ) 
(l-xM^-|fe G M*ri0 B G) 



(16) 



The upper bounds, piX'^+lO^s) are easy to calculate 
from 9 BS . In practice, the background model does not 
change very much over the course of the EM algorithm 
as only a small fraction of the base pairs are explained as 
binding sites. Therefore, we keep the background model 
fixed and pre-compute the lower bounds, p(X* + \9 B g), i n 
an initialization step. 

Expected efficiencies 

In order to understand the computational savings this ap- 
proximation makes, we give an analysis of a simplified 
example. We investigate the expected fraction of nodes 
we ignore at each depth in our descent of the suffix tree. 

Suppose our current estimate of the PWM has a 
preferred base at each position. Each preferred base has 
probability a and the other three bases at each position are 
equally likely with probability ^f-. When a = 1, our PWM 
is equivalent to a consensus sequence, when a = A our 
PWM has a uniform distribution. As c„,/, « XNa we set 
8 = eXa where e is the maximum relative error we will 
tolerate. Suppose also our background model is a 



uniform 0-order Markov model, then p(X n \Q B c) = 

As 1 « 1 —X and recalling (16), we want to know when 

the following holds 



(Z n ) < (Z„) 



Xp(X^\6 BS )a 

A-W 



W-w 



< S = eXa 



Let Y be the number of preferred bases in X%~ '. 
Assuming that X%~~ is drawn from our background 
distribution, we have Y ~ Binomial(tr, 4). Now 



iogx*ri0Bs) 

whenever 



Y\oga + (w- Y)\og^ 



Y< 



loge- Wlog4-(W- l)loga 
log a - log( 1 - a) + log 3 



Hence 



(17) 



we can ignore all the nodes with prefix X^ . For any given 
values of e, W and a, the expected fraction of nodes 
ignored at depth w is the probability that Equation (17) 
holds. As Y is distributed according to a binomial distri- 
bution, these values can be read directly from the binomial 
cumulative distribution function. We plot these expected 
fractions for some parameter values in Figure 4. 



Open source implementation 

We have implemented the STEME algorithm in C++ as 
an open source library. For the suffix tree implementation, 
we used the SeqAn library (34). In addition to the C++ 
interface we have implemented a Python scripting inter- 
face to make it more accessible. The codes are tested on 
Linux with GCC 4.4 and Python 2.6 but should work with 
any modern C++ compiler and version of Python 2 newer 
than 2.5. Our implementation is available at http://sysbio. 
mrc-bsu.cam.ac.uk/johns/STEME/. Our implementation 
requires 500 Mb of memory to work with data sets of up 
to 1 3 Mb, which is well within the range of modern 
desktop or laptop machines. Building the suffix tree for 



1.0 



O-O a=0.7W= 
O-O a=0.6W= 
□-□ a=0.7W= 
□■□ a=0.6W= 




- 0.2 



Figure 4. The probability of discarding a W-mer drawn from a 
uniform 0-order Markov background at different depths, w, in the 
suffix tree. Here we used e - 0.4. As explained in the text, a represents 
how sharp the current estimate of the motif is. The higher a is, the 
sharper the motif. Examining the graph reveals that with a moderately 
sharp motif (a = 0.7) of width 8, we can expect to discard over half the 
nodes in the tree at depth w = 6. 



el26 Nucleic Acids Research, 2011, Vol. 39, No. 18 



Page 6 of 10 



such a data set takes 19 s on our laptop. These space and 
time requirements scale linearly in the size of the input. 

Test data sets 

We used data from two sources for our tests (Table 1): a 
set of six smaller ChlP-chip and ChlP-seq data sets we had 
previously worked with (35); and five larger data sets from 
the ENCODE project (36). 

The six data sets we had previously worked with were 
prepared as follows. The data for Spl were extracted from 
TRANSFAC Professional 11.4 and the flanking bases 
added by TRANSFAC were removed. The data sets for 
GABP, Statl, Stat5a and Stat5b were processed to extract 
the binding site sequences using the cisGenome software 
suite vl.O (41). In every case, both sequences and controls 
were used. Binding region boundary refinement was used 
and then the region extended on each side by 30 bp. GABP 
peaks were selected if there were more than 1 8 reads in a 
rolling 100 bp sequence window compared with the 
control. This higher figure was selected to remove 
visually noisy peaks and 10 767 peaks were detected. 
Cut-offs of 30 and 20 reads were used for the Stat5a and 
Stat5b data, respectively, yielding 814 and 154 peaks. 
RepeatMasker was used on all the test data sets to mask 
repetitive elements using the genomic context for each 
sequence. We provide the sequences as part of the 
Supplementary Data. These files give data for sequences 
and genomic coordinates. The results in the article are 
based on the masked data, but the unmasked data are 
given for completeness. The sequences are given in 
FASTA format and notes about the files for genomic co- 
ordinates (including assembly versions) are given within 
the files. 

The five larger data sets from the ENCODE project 
were produced by the Myers Lab at the HudsonAlpha 
Institute for Biotechnology. We downloaded the data for 
SRF, ZBTB33, RXRA, TCF12 and CTCF from the 
ENCODE Data Coordination Center at UCSC. 

Tests 

In order to test the accuracy and efficiency of the STEME 
approximation, we ran our STEME implementation and 
MEME's EM implementation to completion on the data 
sets. 



Table 1. The test data sets 



TF 


Sequences 


Base pairs 


Publication 


Stat5b 


144 


19 379 


(37) 


Stat5a 


737 


94250 


(37) 


Spl 


296 


207 325 


(38) 


GABP 


2275 


500 203 


(39) 


Statl 


2360 


500409 


(40) 


SRF 


2155 


674443 


(36) 


ZBTB33 


3342 


1 589 893 


(36) 


RXRA 


19126 


8 118061 


(36) 


TCF12 


35 714 


12 540202 


(36) 


CTCF 


41069 


13214001 


(36) 



We wanted to try a range of typical parameters so we 
ran MEME's seed searching algorithm with the default 
arguments. We used motif widths of 8, 11, 15 and 20. 
The number of site parameters took values of 2, 4, 8, 16, 
32, 64, 128, 256 and 500. MEME uses the number of sites 
parameter to initialize k and also to look for the best seed 
(consensus sequence) for the motif. This gave us 6 data 
sets, 4 motif widths and 6 different number of sites par- 
ameters for a total of 144 separate test cases. Additionally, 
we wanted to test the effect of varying the permitted 
relative error so when we ran STEME, we used es of 0, 
0.2, 0.4, 0.6 and 0.8. 

Once we had run the test cases, we needed some way of 
comparing the results of the different implementations 
and the different settings for the permitted relative error, 
e. Comparison of the resulting PWMs would have been 
possible but we chose to perform a simplified analysis by 
converting the resulting PWMs into consensus sequences 
and using the Hamming distance as a distance metric. To 
test the accuracy of the STEME approximation, we 
calculated two statistics: the mismatch rate, that is, how 
often the resulting consensus sequences from the same 
starting point differed in any base; and the mismatch 
fraction, that is, what proportion of the bases of the re- 
sulting consensus sequences differed. 

We ran the tests using version 4.5.0 of MEME which 
was released on 8 October 2010. We modified the MEME 
source code in order to obtain precise timing information 
for its EM algorithm. The modifications are available as a 
patch included with the STEME source code. 



RESULTS 

How e affects STEME's accuracy 

We compared the accuracy of STEME when using differ- 
ent bounds on the relative error, e. When e = 0, no ap- 
proximations are made and we used this as a baseline 
for comparison. The average mismatch rate and 
mismatch fraction statistics are plotted in Figure 5. Even 
when a very large relative error of 0.8 is permitted, only 1 
in 6 of the resulting consensus sequences differ and less 
than 1 in 20 bases differ. When using a reasonable value of 
e = 0.4, only around 1 in 8 of the test cases differed from 
the baseline and only 1 in 30 of the resulting bases differed. 

STEME's accuracy relative to MEME 

We also analysed the accuracy of STEME relative to 
MEME. We had hoped that the STEME algorithm with 
the approximation turned off (e = 0) would produce iden- 
tical results to MEME. For reasons we present in the 
'Discussion' section, this is not the case. These results 
are presented in Figure 6. When e = 0, less than 1 in 4 
of the test cases had a different outcome but only about 1 
in 20 of the bases in the resulting consensus sequences 
differed. As an example, when the seed 'ATCCTGTTC 
TC is used with 16 sites on the Spl data set, MEME 
converges to 'CTTCCTTCTCT' and STEME converges 
to 'CTCCCTTCTCT'. 
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Figure 5. An analysis of how increasing the permitted relative error, e, 
affects the outcome of STEME. The STEME algorithm was run from 
the initializations described in the text for various values of e. (A) The 
mismatch rate: The fraction of resulting consensus sequences that 
differed from those when e = 0. (B) The mismatch fraction: The 
fraction of bases in the resulting consensus sequences that differed 
from those when e — 0. 




Figure 6. An analysis of the accuracy of STEME for various values of 
e relative to MEME. The STEME algorithm was run from from the 
initializations described in the text for various values of e. (A) The 
mismatch rate: the fraction of resulting consensus sequences that 
differed from the results of MEME. (B) The mismatch fraction: the 
fraction of bases in the resulting consensus sequences that differed from 
the results of MEME. 



Efficiency 

We compared the runtime for an iteration of STEME to 
an iteration of the MEME EM algorithm. The relative 
speeds are dependent on the value of e chosen and on 
the width of the motif, as shown in Figure 7. STEME is 
significantly quicker than MEME for reasonable values of 
e and typical motif widths. 

DISCUSSION 

Accuracy 

Examining Figure 5 we can see that when e = 0.4 about 1 
in 8 of our applications of EM had some discrepancy with 
the exact algorithm and about 1 base in 30 differed 
overall, in our experience, this represents a satisfactory 
compromise of speed and accuracy, in any case, it is not 
clear if all the differences introduced by the approximation 
have a negative effect. Our approximation ignores those 
putative binding sites that are not a good match to the 
motif rather than discounting them, ft could be that by 
only examining the higher quality binding sites, our algo- 
rithm builds a better model of the motif. We hope to in- 
vestigate this possibility in further work integrating our 
STEME algorithm in a motif finder. 



We also compared STEME without any approximation 
to MEME's EM implementation (Figure 6). We had 
hoped the implementations would agree. Unfortunately, 
there were some discrepancies. We spent some time reverse 
engineering the MEME source code and discovered some 
inconsistencies between the published MEME algorithm 
(42) and the latest implementation. In particular, the 
handling of reverse complements is not discussed in the 
published algorithm. STEME treats each draw as a 50/50 
mixture between a binding site on the positive strand and 
a binding site on the negative strand. The MEME imple- 
mentation essentially doubles the size of the data by 
adding a reverse-complemented copy of the data. 
Despite this, STEME and MEME converge on essentially 
the same motifs. On average, only 1 base in 20 differs. 

Interestingly, it appears that there is significant overlap 
between the test cases for which STEME without any ap- 
proximation differs from MEME and those test cases for 
which the result of the STEME changes as the permitted 
error is allowed to grow. This can be seen in Figure 6 as 
the difference between e = 0 and e = 0.4 is smaller than 
the analogous difference in Figure 5. 

Efficiency 

Figure 7 shows that the speed-up possible through the 
STEME approximation is dependent both on the width 
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Figure 7. A comparison of the speed of STEME and MEME on one 
iteration of the EM algorithm. (A) Using e = 0.4 as a typical setting, 
the iteration speeds across all the data sets are plotted on a log 10 scale. 
The error bars represent the standard deviations. (B) A violin plot of 
the relative speeds of MEME and STEME grouped by e. With e = 0, 
STEME can be slower than MEME although we would expect this to 
reverse on larger data sets. As e grows, STEME's advantage grows. The 
contours of the violin plots are kernel density estimates that are 
truncated at the minimum and maximum values. The j-axes are on a 
log 10 scale. (C) Using e = 0.4 as a typical setting, the relative speeds 
grouped by motif width. For motifs of width 8, STEME is between 
10 3 = 2 and 10 21 = 125 times quicker than MEME. 



of the motif considered and the relative error tolerated in 
the estimation of the motif. For motifs of reasonable size 
(W = 8 or 11), an order of magnitude increase in speed 
over MEME can be expected when using a relative error 
of e = 0.4. Our approximation is consistently quicker than 
MEME's implementation of EM which is already highly 
optimized. STEME achieves an order of magnitude 
increase in speed on data sets of moderate size for a 
wide range of reasonable parameters. In the coming 
years, we expect the average size of data sets to continue 
increasing. 
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Table 2. Timings for STEME with search for seeds and complete 
MEME algorithm 



TF 


Base pairs (kb) 


W 


STEME (s) 


MEME (s) 


Speed-up 


SRF 


674 


s 


792 


14 760 


18 


ZBTB33 


1 590 


8 


933 


78 339 


84 


TCF12 


12 540 


8 


2122 


4928 532 


2322 


TCF12 


12 540 


10 


27424 


5176744 


189 


TCF12 


12 540 


12 


379 891 


4 597053 


12 



The times to run MEME on the TCF12 data set are estimated from 
partial runs as otherwise they would have taken months to complete. 



Applicability 

We have not presented a complete motif finder but we 
have shown how any motif finder that uses the EM algo- 
rithm on a compatible model can be adapted to handle 
larger data sets. We would have liked to have presented an 
efficient drop-in replacement for MEME but were pre- 
vented from doing so for some technical reasons that we 
elaborate on here. 

The EM algorithm is not a motif finder on its own. The 
result of EM is dependent on how the parameters are 
seeded. Hence to find motifs, suitable seeds must be 
found. MEME's search for seeds is inefficient on large 
data sets. Integrating our fast EM algorithm with 
MEME's slow search for seeds would offer little benefit 
as runtimes would be dominated by the seed search. We 
are working on using suffix trees to re-implement 
MEME's search for seeds more efficiently; however, this 
is a major undertaking in its own right. We have included 
an implementation of our work-in-progress with the 
source code for STEME. It is of practical value for 
motifs of up to width 8 on large data sets (over 500 Kb); 
however, the efficiencies tail off quickly as the motif width 
increases (Table 2). For example, on the 674 Kb SRF 
data set, MEME took over 4h to find a motif of width 
8. In contrast, our implementation with STEME finished 
in 13min, 18 times quicker. 

In addition, the way that MEME calculates the signifi- 
cance of the motifs involves a preprocessing step that does 
not scale well to large numbers of sites. Typically, a user 
will want to choose the number of sites proportionally to 
the number of sequences in the data set. Hence for 
large data sets, the significance calculation needs to be 
re-implemented more efficiently. We are working on this 
using approximations to the LLR P-value calculations. 

CONCLUSION 

Reverse engineering transcriptional networks remains an 
important in silico challenge. Modern biology continues to 
generate ever larger data sets and this trend can be 
expected to continue. Hence there exists a need for good 
motif finders that can handle large data sets. MEME is 
well trusted but does not handle these data sets well. We 
have presented an approximation to EM for models of the 
type used in the MEME algorithm. We have demonstrated 
that this approximation has a minor effect on the outcome 
on the algorithm and is an order of magnitude faster. We 
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have supplied an implementation of this algorithm and 
hope that it will be incorporated into existing and novel 
motif finders. 
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